Skip to content

Conversation

@ermezavx
Copy link
Contributor

@ermezavx ermezavx commented Sep 3, 2025

Fixes #1080

This PR addresses the timing issue where NVMe namespace devices may not be immediately available after controller creation, causing intermittent test failures.

Problem

The current implementation performs immediate device lookup without considering that NVMe devices may need time to appear in the system after target creation and connection.

Solution

Instead of the simple sleep(1) approach proposed in #1081, this implements an intelligent polling mechanism with exponential backoff:

Key Features

  • Smart Polling: 100ms initial delay with 1.2x exponential backoff (capped at 500ms)
  • Reasonable Timeout: 5-second default timeout (configurable)
  • Device Readiness: Includes udevadm settle to ensure proper device initialization
  • Early Return: Returns immediately when devices are found
  • Backward Compatible: No breaking changes to existing API

tests/utils.py Outdated
decoded = decoder.decode(out)
if not decoded or 'Devices' not in decoded:
return []
# Poll for devices with exponential backoff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but this so unnecessarily overengineered, especially for tests. Simple while with few re-tries and time.sleep(1) is enough.

tests/utils.py Outdated
if ret != 0:
return []

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is out of scope for this change, but if you are already changing these function, it would be worth removing the code duplication here.

@tbzatek
Copy link
Member

tbzatek commented Sep 3, 2025

The best approach is to wait for a live controller status in the respective sysfs path first and then wait for udev to finish probing the device.

@ermezavx ermezavx force-pushed the fix-issue-1080-can-not-get-nvme-ns-immediately branch from b19969e to 951643e Compare September 4, 2025 03:46
@ermezavx
Copy link
Contributor Author

ermezavx commented Sep 4, 2025

The best approach is to wait for a live controller status in the respective sysfs path first and then wait for udev to finish probing the device.

I have made the changes according to your suggestions. Please help review it. Thank you.

@ermezavx ermezavx force-pushed the fix-issue-1080-can-not-get-nvme-ns-immediately branch from 951643e to 2cc8e63 Compare September 5, 2025 03:20
@ermezavx ermezavx requested a review from tbzatek September 6, 2025 03:26
Copy link
Member

@tbzatek tbzatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from @vojtechtrefny's suggestions the logic looks good to me. There are multiple possible points of failure, but let's see if it helps.

tests/utils.py Outdated
pass

time.sleep(wait_time)
wait_time = min(wait_time * 1.2, max_wait) # Slower exponential backoff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original comment about this being unnecessarily complicated is still valid here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will update and submit a new version of the code based on your comments.

tests/utils.py Outdated
wait_time = min(wait_time * 1.2, max_wait) # Slower exponential backoff

os.system("udevadm settle")
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this function returns boolean? It's not checked anywhere where it is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, there is no need to return a boolean type.

Fixes storaged-project#1080

- Remove complex exponential backoff in favor of simple 1-second sleep
- Remove unnecessary boolean return value
- Keep subsystem NQN filtering for precise controller monitoring
- Use reasonable 3-second timeout for test efficiency
@ermezavx ermezavx force-pushed the fix-issue-1080-can-not-get-nvme-ns-immediately branch from 2cc8e63 to 0b42cbd Compare September 13, 2025 02:29
@vojtechtrefny
Copy link
Member

Jenkins, ok to test.

@vojtechtrefny vojtechtrefny merged commit 515f99b into storaged-project:master Sep 16, 2025
50 checks passed
@tbzatek tbzatek mentioned this pull request Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

can not get nvme ns immediately after create nvme ctrl

3 participants